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Abstract 


Two  impoitant  panUel  aichitectuie  types  ue  the  thated-memoiy 
aichiteeiuies  and  the  meaaage-paaaing  aichitectuxea.  In  the  past 
teseaicheia  woiking  on  the  paraUel  implementaliaat  of  prodiioian 
syttems  have  focuaaed  eidier  on  fhaied-memoty  multiprooetsofs  or 
OR  special  puxpose  axchitectuies.  Message-pasaing  oomputen  have 
not  been  studied.  The  main  leasons  have  been  the  large  message- 
passing  latency  (as  large  as  a  few  miUiseconds)  and  higii  message 
tecqxian  oveihesds  (several  hundred  microseeonds)  exhibited  by  the 
first  generation  message-passing  computers.  These  oveifaeads  are  too 
large  for  the  parallel  inq>lementatian  of  production  systems,  where  it 
is  necessary  to  exploit  parallelism  at  a  very  fine  granularity  to  obtain 
significant  meed-up  (subtadu  execute  about  100  machine 
instroctions).^ia;yever,  recent  advances  in  inteiconnectian  netwoik 
technology  and  processing  node  desipi  have  cut  the  netwodc  latency 
and  message  receptio^  overiiead  by  2-3  orders  of  magnitude,  making 
these  computets  taian  mote  interesting.  In  this  p^ier  we  ptesem 
techniques  for  mming  pcoductian  systems  onto  message^MSsing 
computers.  We^mow  that  using  a  concunem  distributed  ha^  table 
data  stracture/it  is  passible  to  exfdoit  parallelism  at  a  very  fine 
granularity  an^  to  obtain  significant  ^leed-iqis  from  parallelism  . 

1.  Introduction 

foodaetion  systems  (or  rule-based  systems)  occupy  a  prointnciit 
jdnoe  in  the  fidd  of  AL  They  have  been  used  extensively  in  die 
attempts  to  understand  the  nature  of  inteUigenoe  as  well  as  to  develop 
expert  systems  qranning  a  wide  variety  of  questions.  Rrodnedon 
system  programs,  however,  ate  computatioa  intensive  and  run 
aloedy.  This  dows  down  research  and  limits  die  utility  of  these 
systems,  bthispaper,  we  examine  the  suitability  of  message-passing 
computers  (MFCs)  for  eiqiloiting  parallelism  to  speed-iqi  die 
execution  of  production  ^sterna. 

To  obtain  significant  ^leed-up  fiom  parallelism  in  production 
systems  it  is  necessary  to  mqploit  pendleliam  at  a  very  fine 
granularity.  For  cxaniple,  the  average  number  of  insttuedons 
executed  by  subiaaks  in  the  parallel  implenientation  suggested  in 
[10]  is  only  about  100.  b  the  past,  researchers  have  explored  the  use 
of  special-potpoee  architectures  and  diared  memory  muldprooeasors 
to  capture  this  fine-graiiied  parallelism  [10, 16, 17, 18, 11,21]. 
Howem,  the  petfonnance  of  MFO  for  ptotbedon  syttems  hat  not 
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been  analyzed.  Considering  MFCs  is  important,  befaiitc  MFCs 
r^nsent  a  major  architectural  and  prqgrammmg  model  b  current 
use.  bevioualy,  the  communication  delays  m  die  MFCs  made  them 
impossible  to  be  used  for  the  purpose  of  eiqiloidng  fine  grained 
parallelism.  However,  reoem  deveiopmentt  m  the  implementadons 
of  MFCs  [3],  have  reduced  the  communkadon  delays  and  the 
message  processing  overheads  by  2-3  orders  of  magnitude.  The 
presence  of  these  new  generadon  MFCs  such  as  the  AMETEK-2010 
[19]  makes  it  bteresdng  to  consider  MFCs  for  implemendng 
jnoduedon  systems. 

This  paper  is  organized  as  follows.  Section  2  describes  the  OPSS 
produedon  system  and  the  Reie  matefamg  algorithm  used  u 
impiemeodng  it  Secdon  3  deactibet  recent  developmeau  in  the 
MFCs  and  presents  the  assumptions  about  their  exeendon  dines 
which  we  will  use  m  our  analysis.  Secdon  4  presents  our  scheme  for 
implemendng  OFSS  on  the  hffCs.  We  then  evaluate  its  petfonnanoe 
and  compare  it  with  other  parallel  hnplementadons  of  produedon 
systems.  pi  pp«^*rir  ■ 


2.  Background 
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2.1.  OPSS 

An  OPSS  [2]  produedon  system  is  composrri  of  a  set  of  tf-then 
rules  called  productions  that  make  iqi  the  production  memory,  and  a 
database  of  tenqioniy  assertions,  called  die  woHmg  memory.  The 
individual  aaserdosu  are  called  wnkiiig  memory  elmneiits  (WMEs), 
which  are  lists  of  attiibiiie-valoe  pairs.  Each  produedon  consists  of  a 
conjunction  of  condition  elements  (CEs)  come^ionding  to  the  part 
of  the  rule  (die  left-hand  side  or  LHS),  and  a  set  of  actions 
conespondmg  to  the  then  part  of  the  rule  (the  right-hand  side  or 
RHS). 

The  CEs  u  a  psoduedon  consist  of  attribute-value  tests,  where 
tome  anributet  may  contau  variables  as  values.  The  attiibuie-value 
tests  of  a  CE  must  ^  be  matched  by  a  WME  for  the  CE  to  match;  the 
variables  b  the  condidon  element  may  match  any  value,  but  if  the 
variable  occurs  in  more  than  one  ^  of  a  production,  then  all 
oocurienoes  of  the  variable  must  match  identical  values.  When  aU 
the  CEs  of  a  produedon  are  matched,  the  production  is  sadriied,  and 
an  instantiation  of  the  production  (a  to  of  WMEs  that  matched  it),  it 
created  and  entered  into  die  conflict  set.  The  production  system  uses 
a  selection  procedure  called  confUet-reeolution  to  choose  a 
production  frm  the  conflict  set,  which  is  then  fired.  When  a 
production  fires,  the  RHS  actions  associated  with  that  production  ate 
executed.  The  RHS  actions  can  add,  remove  or  ram^  WMEs,  or 
perform  I/O. 

The  production  system  is  executed  by  aa  interpreter  diat 
repeatedly  cycles  through  three  steps:  match,  confliet-restdution,  and 
act.  The  matching  procedure  determines  to  set  of  satisfied 
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pradnctiQBt,  the  caaflict-ienlntioa  pcocedinc  telecu  the  hi^a 
pBontjMBaamiation,  and  the  act  pioceduie  execuKt  iu  RHS. 


^2.Scte 

Rcte  [7]  is  a  highly  cfBcient  match  algorithm  that  it  also  soitable 
for  pvaUd  in^tlementations  [91.  Rete  gains  its  efGciency  from  two 
optaniaatioos.  Bist,  it  eiqtlaits  the  fact  that  only  a  small  fraction  oi 
sratiring  memofy  changes  each  cycle  by  aoring  results  of  match  from 
psevieus  cydes  and  using  them  in  subsequent  cycles.  Second,  it 
esploits  the  commonality  betsveen  CEs  of  productions,  to  reduce  the 
niaaber  of  teas  petfosm^ 

Rale  uses  a  qiecial  kind  of  a  data-flow  netwodc  compiled  from  the 
UlSsafproductions  to  petfonn  match.  Tlie  netwodc  is  generated  at 
compile  time,  befote  the  production  system  is  actually  tun.  Ihe 
entities  that  flow  in  this  netwodc  are  called  tokms,  sdiich  consia  of  a 
tag,  a  £0  tfWME  time-tags,  and  aUst  ef  variable  bindings.  The  tag 
iscidiera-t-ara— indicating  the  addition  or  deletion  of  a  WME.  The 
Ma  of  WME  time-tags  identifies  the  data  dements  matching  a 
subssignence  of  CEs  in  the  prodnetian.  The  lia  of  vaiiaUe  bindings 
aseodaed  srith  a  token  corresponds  to  the  bindings  created  for 
variaUes  in  those  CEs  tha  the  system  is  trying  to  match  or  has 
already  matched, 

There  are  primarily  diree  types  of  nodes  in  the  netsradc  which  use 
the  tokens  described  abow  to  peifoim  match: 

1.  Constant-test  noder.  These  ate  used  to  tea  the  constant- 
value  attrOaites  of  die  CEs  and  ahvays  ippear  in  the  top 
pan  of  the  netwodc.  They  take  less  than  10%  of  the  time 
spent  in  Match. 

2.  Memory  nodes:  These  store  the  results  of  the  match  phase 
from  pteviout  cycles  as  state.  This  state  conrisu  of  a  list 
of  the  tokens  dut  match  a  part  of  die  LHS  of  the 
associated  prodnedon.  This  way  only  changes  made  to 
the  working  memory  by  die  moa  recent  production  firing 
have  to  be  processed  every  cycle. 

3.  Two-inpia  nodes:  These  tea  for  joint  satisfaction  of  CEs 
in  the  LHS  of  a  production.  Both  inputs  of  a  two-irqiut 
node  come  from  memory  nodes.  Wto  a  token  arrives 
from  die  Ufi  memory,  i*.,  on  the  left  iipot  of  a  tero-irqiut 
node,  it  is  compared  to  each  token  stosed  in  the  right 
memory.  All  token  pain  tha  have  oonsiaent  variable 
bindings  are  sem  to  the  suocesson  of  the  two-hpot  node. 
Simila  acdon  ooenn  when  a  token  arrives  from  the  right 
memory.  Werefertosochanaetianasaeodr-oeriwaion. 

RforeS-l  ahosssdieRetenafriraproductiantiamedPl. 


3.  Mtmge-PMgfaig  Computers  and  Assumptions 
MFCs  ate  MIMD  computers  based  on  the  progtamming  model  of 
ooneuaent  processes  communicating  by  message  passing.  There  is 
no  ^obal  shared  memory  and  hence  communication  baweeo  die 
concmient  proceaaes  is  explicit  as  in  Hoate’s  CSP  [12],  though  not 
necessaifly  synciitonoos.  The  eady  MFCs  such  as  die  Coamic 
Cdbe  PO]  had  a  high  netsrosk  latency  of  about  >>2  millisecond  (ms) 
and  a  Ug^  overhead  of  message  handling  of  about  >300 
raicrnoecends  (ps).  As  a  result,  it  was  m^ocaible  to  exploit 
paraUelion  at  the  fine  granularity  of  50-100  ps  as  is  necessary  in 
produetkn  systenu. 
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potentially  reduce  the  message  reception  overhead  by  an  order  of 
magnitude.  Ufith  today's  VLSI  technology,  it  is  possible  to  construct 
MFCs  with  thousands  of  processing  nodes  and  hundreds  of 
megabytes  of  memory  [3].  Thus  very  fine  grain  parallelism  can  now 
be  exploited  easily  with  ^  MFCs. 

This  raises  die  issue  of  whether  production  systems  can  be 
implemented  efficiently  on  the  MFCs  to  give  good  ipeedups,  which 
we  analyze  in  detail  in  this  pqier.  For  die  purpose  of  this  analysis, 
we  assume  a  32-ary  2-cube  architecture  (1024  nodes),  with  a  4  MIPS 
processor  at  each  node  similar  to  the  MDF.  The  various  times  that 
required  for  our  analysis  are  as  follows.  The  latency  of  wormhole 
routing  is  given  by 

*11  -  *c<0  !•/*»> 

Where — 

Channel  Delay,  assumed  to  be  50  nanoseconds 
(ns),  as  in  [3]. 

W  Channel  Width,  assumed  to  be  16  bits. 

L  Length  of  the  message  in  bits. 

D  Oistaiice  or  number  of  bops  traveled  by  die 

message.  If  two  processing  nodes  are  selected  at 
landom  in  a  k-ary  n-enbe,  then  number  of  hops  is 
n*(k^-  l)/3k>iUfrir  our  32-ary  2-cube. 

We  assume  dial  the  MDF  is  driven  by  a  100  ns  clock  and  that  the 
time  to  csiecute  a  send  (broadcast)  command  is 

T,w  (5  +  ■*&)  eleok  egreloo. 

where  a  message  of  Q  words  is  to  be  sem  to  N  sites  [5].  The 
overhead  of  receiving  messages  is  assumed  msignificant  [5].  Thus 
there  are  two  delqrs  associatrd  with  a  message:  T,  in  transmission, 
in  it*  communication. 


4.  Mapping  Rete  on  the  MFC 
In  this  section  we  describe  our  moping  irf  Rete  on  the  MFCs.  We 
draw  heavily  from  our  previous  wo±  widi  the  PSM  implementatians 
of  production  systems  on  shared-memory  multqirocesaors  [9, 10, 21]. 


Racem  developments  in  MFCs  such  as  worm-hole  routing  [4]  have 
ttdaeed  the  iietsrotk  latencies  to  2-3  ps  and  the  use  of  ipeciai 
pratwaaors  such  as  the  MDF  (Message  Driven  Froeeaaor)  [5]  can 


One  possible  scheme  for  inqilementing  OFS5  on  the  MFCs  arises 


fiam  vwwiag  Reie  in  an  objaa-orimitd  manner,  where  the  nodee  of 
Reia  ara  objacia  and  tokent  are  meaaagea.  Thia  echeme  a  tingle 
otgect  (node)  of  Rote  onto  a  aingle  peoceaaar  of  the  MFC.  However, 
them  an  two  aenoot  proUcma:  (1)  The  mapping  requiiet  one 
piocaaaoc  per  node  of  the  Reie  net,  and  the  piaceaaor  utilization  of 
aueh  a  idwnie  ia  expected  to  be  very  low,  (2)  Often,  the  proceaaing 
of  a  WME  change  reauha  in  mult^ie  aedvationa  of  the  aame  Rete 
node,  which  in  above  mqtping  would  be  pioceaaed  aequentially 
on  the  aame  PE,  thua  canning  that  re  to  be  a  bottleneck. 


node  procaasora  act  proeaaaora 

Flgnrc4>l:  A  higjli  level  view  of  the  Mapping  on  the  MPCa. 


To  overcome  die  limitadona  of  above  mapping,  we  propone  an 
alternative  mapping,  a  high>level  picture  of  which  it  thown  in  Rgute 
4>1.  At  the  hean  of  thia  mqrping  ia  a  concurrefU  distributed 
bask-tabis  [SI  data  atmeture  that  enaUet  fine-grain  exploitatioa  of 
coocnnency.  The  detaila  are  deacribed  later  in  thia  aection.  An 
ahown  in  die  figute  4-1,  the  parallel  mapping  oontiaa  of  1  control 
procotsor,  4  constant-node  processors,  4  cotflict  set  proeessorr,  and 
the  teat  ate  match  processors.  The  cenatam-teat  nodea  of  the  Rete  net 
ate  divided  into  4  parta  and  aatigned  to  the  conttant-node  ptoceaaora. 
The  match  proceaaota  perform  the  function  of  the  reat  of  the  Rete  net 
The  confUet-aet  ptooeatora  perform  conflict-teaolution  on  the 
tnatantiatumn  aent  to  them.  Subaequendy,  they  aend  the  beat 
tnatandation  to  die  control  proceator.  The  control  procecaor  ia 
reymtibic  for  performing  conflict-teaolution  among  the  ben 
inatantiationa,  evaluating  the  RHS  and  performing  other  functiona  of 
die  tmorpreter. 

Aa  mcntioiwd  in  Section  2.2,  moat  of  the  time  in  match  ia  qient 
p— f"t  two-input  node  aedvationa.  Haahing  die  contenta  of  die 
ataocined  memory  nodea,  inatead  of  atoting  them  in  linear  liata, 
rednoea  the  number  of  compariaona  performed  during  a  node- 
activation  and  diua  inqiiovea  die  performance  of  Rete.  One  hath 
table  ia  naed  for  all  left  memory  nodea  in  the  network  and  the  other 
for  an  right  memory  nodea.  The  hath  ftmetion  that  ia  ^iplied  to  the 
tokont  takoe  into  account  (1)  the  variable  bindingc  teated  to  equality 
at  the  two-input  node,  and  (2)  the  uriiqne  node-identifier  of  the 
deaiination  two-input  node.  Thia  permitt  quick  detection  of  the 
tokana  that  ate  likely  to  paaa  the  equal  variable  teata. 

In  our  mappiag,  to  allow  tbe  parallel  proceaaing  of  (1)  tokena 
deetinod  to  the  same  two-ii^nt  node  and  (2)  tokena  deatined  to 
draatent  two-input  nodea,  the  hatii  tablea  buckau  atoting  the  tokena 
are  diatiibiiied  among  the  rea  M  the  proeeaaor  array.  In  particular,  a 
amdl  number  of  correaponding  bud^  from  tbe  left  and  right  hath 
ndilea  are  aatigned  to  each  proeeuor  pair  in  the  array  --  the  left- 
bnekata  to  the  left  ptooeaaor  and  the  ri^  bucketa  to  the  right 
paoceaaor.  (Note  that  when  prooeaaing  a  node  activation,  the  left  and 


tight  bucketa  at  only  one  index  need  to  be  acoeaaed.)  Thia  mapping  ia 
pictorially  depicted  in  Hgure  4-2.  There  ia  one  leattictioo  on  the 
communication  with  the  proceaaor-pair  —  it  can  only  be  done 
through  the  l^-processor.  Allowing  communication  with  both  left 
and  right  pioceaaora  can  reault  in  creation  of  diqilicate  tokena  leading 
to  incorrect  behavior,  and  it  doea  not  gain  aa  miidi  in  concurrency. 


A  proceaaor-pair  together  perfbema  the  activity  of  a  tingle  node 
activation.  Contider  the  caae  when  a  token  cotreqionding  to  the 
left-activation  of  a  two-input  node  arrivea  at  a  proceaaor-pair.  The 
left  prooeaaor  hnmediaiely  tcanamita  tbe  token  to  the  ri^t  prooeaaor. 

The  left  proceaaor  then  copiea  tbe  token  into  a  dma-atructure  and  adda 
it  to  the  appropriate  haafa-taUe  budeet  Meanwhile,  the  right 
proceaaor  conqiarea  the  token  with  contenta  of  the  appropriate  right 
bucket  to  generate  tokeiu  roquirod  to  aucceaaor  aetivatiooa.  ■  ■ 
The  right-proceaaor  then  cah^atea  the  haah  value  to  the  newly 
created  tokena,  and  aeoda  each  token  to  the  prooeeeor  pair  which 
owne  the  buckrta  that  it  haahea  to.  The  activitiea  performed  by  tiie 
mdividual  proceaaora  of  the  proceaaor  pair  are  called  imcro-toakr,  and  ^ 
all  the  micro-taaka  on  the  varioua  proceaaw  poire  are  performed  in  Q 
paralleL 

The  performance  of  thia  acbeme  d^enda  on  the  diacriminability  of 
haahing.  Two  obaervationa  can  be  made  in  thia  reapect: 

1.  Haahing  it  baaed  on  equality  teata  in  CEt  and  90%  of  the 
teata  at  two  input  nodea  are  equality  teata  [9].  les _ I 
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ZThe  loda  on  the  hadi  taUcf  in  the  PSM  inqtfemeatatioac 
have  not  been  Men  to  be  bottleneck*  [10, 21], 

Hw*  faadiing  is  not  expected  to  be  a  problem  in  geneiaL 
However,  in  certain  production  systems,  a  large  number  of  two-input 
nodes  do  not  have  any  tests.  For  such  nodes,  various  schemes  as 
propoeed  in  [1],  can  bt  used  to  introduce  discriminability  into  the 
tokens  fsneraied.  Furtbermaro,  when  the  compiler  does  come  across 
nodes  srhich  caimot  be  hashed,  it  can  assign  a  larger  number  of 
piocessuss  for  diat  pair  of  buckets,  (since  all  the  tokens  would  end  up 
in  a  single  pair  of  buckets)  dms  breaking  the  processing. 

Ibe  code  for  die  Rete  net  is  to  be  encoded  in  the  OPS83  [8] 
software  technology.  With  this  encoding,  large  OPS5  program*  (widi 
w  lOOO  productions)  impure  about  1-2  Mbytes  of  memory  —  a 
problem,  since  each  MFC  processor  has  only  10-20  kbytes  ^  local 
memory.  We  therefore  um  tsro  strategie*  to  save  space: 

1.  Partition  the  nodes  of  Rete  such  that  each  processor 
evaluates  nodes  from  only  one  partitioiL  This 
partitioning  is  eaaly  achieved  if  the  hash  function 
preserves  some  bits  from  the  node-id.  To  avoid 
contention,  nodes  belong  to  a  single  production  ate  put 
into  different  partitions. 

2.  One  cause  of  the  large  memory  terpurement  is  the  in-line 
crqianaioo  of  proceduie*.  We  can  instead  encode  the  two- 
irqmt  nodes  into  structures  of  14  bytes,  indexed  by  the 
node-id.  A  small  performance  penalty  of  loading  die 
retpiired  information  into  reguters  is  then  paid  in  the 


The  system’s  overall  operation  is  as  follows: 

1. Tbe  control  processor  evaluates  a  WME  change  and 
trannnits  it  to  the  constant  node  processors. 

2.  The  constant  node  processors  match  the  WME  with  the 
constams  in  the  CEs.  The  result  of  this  matdi  is  tokens 
diat  have  bindings  for  the  variables  in  matched  CEs. 
Tbero  tokens  represent  individual  node  activations  and 
ate  sent  to  ^iprapriatB  processor  pairs. 

3.  The  fallowing  st^  are  then  iqiealed  by  the  prooessor- 
pobs  unto  corapletian  of  match: 

•  Split  the  node-activation  into  mkio-taiks  and 
p^onm  them  in  paialleL 

•  Count  the  number  of  successor  tokeru  generated 
due  to  du*  token;  if  no  aucoessois  are  generated, 
then  send  m  aeioiowledgenieiit  (ack)  message  to 
this  processor  pair’s  activator. 

•  Accept  adc  message*  from  the  successors.  If 
accounted  for  all  successors  of  a  token,  send  an  ack 
message  to  die  activator. 

Detecting  termination  in  a  dittiibuied  system  is  a  con^lex 
probfem  in  itself  [IS].  The  ack  messages  pimtide  an  easy  and 
reasonably  efOcient  method  aS  informing  the  conflict-set  processors 
abont  the  completioo  of  the  match.  Thus  after  the  processing  of  die 
last  activation  in  the  current  match  cycle,  a  single  stream  of  ack 
fimssagrs  flows  back,  finally  to  die  control  processor,  which  then 
infonas  the  conflict  set  processors  that  the  mat^  is  completed. 


5.  Performance  Analysis 

We  now  evaluate  the  MFC  implementstion  using  the 
measurementt  on  the  Rete  net  from  [9].^  'Ihe  point  of  the  analysis  is 
to  establith  diat  the  MFCs  will  ptovi^  good  qieedi^  conqiued  to 
other  previously  proposed  parallel  implementations,  rather  than  to 
estimate  the  exact  performance  that  wiU  be  obtained  on  a  real 
machine. 

One  of  the  important  number*  for  this  analysis  is  die  time  qient  in 
the  processing  of  one  node  activation.  Using  that,  we  can  estimate  the 
time  for  a  micro-task.  A  node  activation  is  identical  to  a  task  on  the 
FSM,  which  takes  200  ps  on  a  1  MIFS  processor  [10]. 
Measurementt  of  the  number  of  instructions  executed  indicate  that 
about  S0%  of  that  time  is  spent  in  updating  die  hash  bucket  and  50% 
in  perfonning  tests  with  tokens  in  opposite  memory.  We  dicxefore 
assume  that  on  our  4  MIFS  processor,  perfonning  a  micro-task  will 
take  about  25  ftt,«bicb  is  200  ps*  1/4  (due  to  processor  ^leed)  *  0.5 
(due  to  partitioning  of  the  node-activation  into  micro-tasks). 

Since  the  processor-pairs  communicate  via  tokens,  we  also  need  to 
calculate  the  overhead  of  a  token  message.  The  length  of  a  token- 
mesaage  is  dependent  on  the  number  of  variable  bindings  and  the 
number  of  Wb^  timetags  carried  by  the  token.  There  are  on  average 
four  variable  bindings  per  production  [9].  The  number  of  WME 
timetags  is  dependent  on  the  number  of  CEs  in  a  production. 
Assuming  the  number  of  CEs  to  be  (M  s  5)  for  the  moment,  we  um 
the  token-strucuire  in  Bgure  4-2  to  rstimsie  42  bytes  of  information 
per  token.  The  overhead  of  sending  the  token  message  will  therefore 
be  equal  to  T,  *  (5  ■(-  Q  *  N)  clodt  cycles,  with  Q  b  42/4  words  and  N 
B  1  processor  (see  section  3).  Substituting,  we  get  T, «  1.6  pa.  The 
communication  delay  Tw.  is  given  by  Tj^  +  L/W).  This 
communication  will  be  between  a  random  pair*  of  processors. 
Therefore,  D  «  2Z  We  have  assumed  T^  to  be  SOat  and  W  to  be  16. 
Our  L  it  42  *  8  B  336  bits.  Subetimting,  we  get  T,^  b  22  pt.  The 
total  delay  will  be  therefore  1.6  -f  22  b  3.8  p*  per  token  message 
between  processor-pairs. 

We  can  now  estimate  the  cost  of  one  match  cycle.  The  n^s 
bdow  coneqxmd  to  the  algorithm  in  the  previous  section. 

Step  1:  The  WME  changes  are  tranmiitted  to  the  4  constant-node 
processors.  The  cost  of  addition  of  a  WME  ia  as  foUowK  The 
average  WME  consistt  of  24  aitiibnte  value  pairs,  which  can  be 
encoded  in  24  bytes  for  attributes  +  24  words  for  the  values  b  30 
words.  Broadcasting  tiiis  WME  takes  T,  b  (S  30  word*  *  4 
processors)  clock  cycle*  Le.,  12.5  ps. 

For  the  communication  delay,  T^,  D  b  1  since  the  constant  node 
processors  are  one  hop  away  from  tte  control  processor.  The  value 
of  L  is  30  words  *  32  Mts/word  b  960  bits;  W  b  16  and  the  value  of 
T^  is  fixed  at  50.  Substituting,  we  get  T^  *  3.1  p*.  Thus  the  total 
time  qient  in  communication  during  WME-addition  is  15.6  ps. 

For  deleting  a  WME,  only  the  timetag  of  the  WME  to  be  deleted  is 
passed  on  to  the  constant-node  processors.  Calculating  T,  and  T^  in 
a  similar  fashion,  we  get  the  total  time  ^lent  in  delete  to  be  1.1  ps. 
There  is  an  average  of  2.5  WME  dianges  per  cycle.  Assuming  equal 
proportion*  of  add*  and  deletes,  the  cost  of  tiie  first  step  is  1.25(1.1 
15.6)- 21  p*. 


’Wc  ds  iiK  bbIjw  dB  ODofliet^aohRioii  Bid  aeiioii  putt  of  dw  much  caico  diou  Bin 
iMi  ten  10%  of  te  tiiM  bi  •  itrkl  implsintiitteon.  Siope  wp  hive  dhfMipd  up  te  poafliet 
m  and  pipnliapd  te  action  pan  wHh  te  mai^  teat  ihoM  take  eean  laaa  time  ten  that 
In  caae  tey  do  bapome  faottkoeeka,  varioua  achciBPB  diacmiad  m  (9]  can  be  aaad  to  foteee 
teir  ovorbaada. 


Step  2:  The  Cfwgtint  teiU  ate  now  evaluated.  Aaauming  that  die 
conatant  leau  are  impiamented  via  haahing,  diere  are  20  conctant- 
node  aciivaiioos  per  WME  change  [9].  On  average,  each  paidtioo 
win  have  S  acdvaticne  per  WME  chnge.  Hiue  atwut  (5*2/4 
MIPS)  M  2.5  ps  are  qient  in  matching  the  ccnetant  nodes.  A  token 
stiuetnxe  is  thn  genented  and  hindingt  are  ciemed  for  the  vaiiaUe(s) 

the  CEa  which  passed  the  teste.  Measurements  [9]  show  that  dim 
will  be  about  5*7  such  tokens  generated  per  WME  c^ge,  whidi  see 
assume  to  take  20  ps.  This  udiole  operation  of  processing  a  WME- 
change  by  a  constant-node  ptooesaor  is  therefore  estimated  to  take 
about  22J  ps.  For  the  Z5  WME<hanges,  (22.5  *  2.5)  •>  56  ps  svill 
be  ipent  in  processing  the  constant  nodes  and  geneiating  the  initial 
tokens  in  a  cycle.  The  generation  of  these  tokens  is  p^ielined  sridi 
sending  die  udcens  to  die  match  processois. 

Step  3:  The  processor^iairs  perform  the  rest  of  the  match.  The 
node-activation  tyjncally  go  to  different  processor-pairs,  and  are 
processed  in  parall^  Therefore,  the  total  time  to  finish  die  match  is 
determined  the  longest  chain  of  dependent  node-activations,  siiice 
die  micro-ta^  in  the  chain  have  to  be  processed  sequentially.  On  an 
average,  die  chain  will  be  generated  after  50%  of  the  initial  tokens  in 
a  cycle  have  been  generated.  A  constant-node  processor  takes  56  ps 
to  generate  all  the  initial  tokens;  therefore,  we  assume  that  the  initial 
token  generating  die  long  chain  wiU  be  created  after  28  ps.  Including 
the  constant-node  processors,  let  the  longest  chain  be  length  M  s 
5. 

When  a  token  arrives  at  the  left  processor,  it  is  immediately 
transmitted  to  the  right  processor.  For  ttottansmisaioo,T^  is  still  1.6 
ps.  But,  T^  s  50(1  charmel  +  42  *  8/16)  >  1.1  ps.  Thus,  after  a 
token  arrives  at  the  left  processor,  it  will  take  1.6  +  1.1  s  2.7  ps  to 
reach  the  tight  ptocessor.  The  right  processor  will  take  25  ps  to  finish 
the  micro-taak.  h  wUl  then  take  3.8  ps  for  the  successor  token  to 
readi  its  destination.  Ihut,  the  time  to  complete  a  miero-tadk  it  25  -»■ 
2.7  4  3.8  s  31.5  ps.  A  chain  aflettgtfa5vrill  therefore  take  3 1.5*44 
28  ps  (dne  to  the  constant  nodes)  *  154  ps.  (Similar  analysis  could 
be  done  if  the  tnccesBors  are  generated  by  the  left  processor). 

The  ack  messages  are  propagated  back  through  the  node  activation 
chain,  after  the  last  activation  it  processed.  It  is  1  word  of 
information  and  so  we  estimate  T^  >  1.2  ps  and  T,  s  0.6  ps. 
Assuming  that  the  ack  is  ptooetsed  in  1  ps,  the  time  spent  in  the 
chamofackmessaget  is(Mz5)  *  (1  4  1.2  4  0.6) «  14.0  ps.  Adding 
all  the  numbers  together,  we  get  tte  time  for  MFC  to  match  to  be 
iqynarimately  154  4  14  4  21  >  189  ps. 

A  production  system  generates  200  micro-tasks  on  an 
average/cycle,  and  therefore  a  unqirocesaor  will  take  200  *  25  a  5(X}0 
ps  per  cyde.  Using  this  we  get  about  26  fold  ^lerdup  for  the  above 
system  ssidi  die  longest  chain  of  M  >  5.  This  is  -60%  of  the 
maximnm  parallelism  ei^oitable  on  an  ideal  multi-processor  at  this 
granularity.  Our  calculations  show  diat  the  speedups  is -14  fold  if  M 
a  10  and  -9  fold  if  M  a  15.  Again,  dus  is  -60%  of  the  maximnm 
avallabic  paralleliam.  This  is  conqiarable  with  the  estimate  of  60% 
erqdoitaUe  parallelitm  in  shared  memory  multiprocessors  at  the 
node-activation  level  (9].  This  coarser  grain  node-activation  level 
parallcHsm  can  be  erqiloited  on  the  MFCs  by  allocating  both  the  left 
and  right  buckets  to  one  ptocessor.  Our  calculations  show  that  the 
micro-tadt  baaed  scheme  is  c^iable  of  exploiting  1.5  time  more 
speedup  than  a  scheme  to  «qdoit  the  node-activation  level 
parallelitm. 


6.  Discussion 

Comparing  the  MFC  implementation  to  a  diated  menuMy  multi- 
ptocesaor  implementation,  we  see  that  the  principle  advantage  of  the 
MFC  itnpletnentation  is  the  absence  of  a  centrsiixed  task-acheduler, 
which  can  be  a  potential  bottleneck.  As  shown  in  [9],  in  ahared- 
memoiy  impleinentations,  a  alow  sdieduler  foroes  saturation  ^leedop 
with  relatively  small  number  of  processors,  irreqiective  of  die 
inhetem  par^eliam  in  the  system.  However,  the  MFC 
implementation  suffers  firom  a  static  partitioning  of  die  hash  tables.  It 
it  posaiUe  that  distinct  tokens,  which  could  potentially  be  processed 
in  parallel,  are  processed  sequentially  because  they  ha^  to  the  same 
processor  pair.  Such  a  possibility  does  not  arise  in  the  shared- 
memory  inqileineiitatioo,  since  the  size  of  the  hash  table  is 
independent  of  dw  number  of  jnooesaors. 

Another  tradeoff  to  be  considered  is  between  processor  utilization 
and  the  number  of  processors.  With  a  higher  number  of  processors, 
the  processor  utilization  will  be  low,  but  the  message  contention  in 
the  network  will  be  reduced.  As  die  number  of  processors  is  reduced, 
processor  utilization  will  be  improved;  but  again,  this  will  also 
increase  the  hatii  table  contention.  Thus  diete  are  some  interesting 
tradeoffs  involved  in  moving  tossards  the  MFCs. 

A  mipping  eimilar  to  one  proposed  in  this  paper  has  been  need  to 
implement  production  systems  on  the  simulator  tat  Nectar,  a  netwoik 
computer  architecture  ssith  low  message  passing  latencies  [13]. 
These  simulatiom  tiiow  that  good  speediys  can  be  obtained 
implementing  production  systems  on  MFCs  svidi  low  latencies  [22]. 
The  simnlations  also  indict  that  the  constant  node  prooessors  can 
quickly  become  bottlenecks  if  the  initial  tokens  ate  not  generated  and 
sent  fast  enough.  In  our  current  impleroentation,  we  have  hashed  the 
constant  nodes  to  take  care  of  such  a  possibility.  If  the  oonstam  node 
processors  continue  to  be  bottlenecks  mqnte  of  this,  then  schemes 
proposed  in  [22]  can  be  used  to  remove  dim. 

Rnally,  we  erould  like  to  reiterate  the  unpottance  of  mapping 
production  systems  on  MFCs.  Current  production  systems  offer 
limited  (10-20  fold)  parallelism  [9].  We  have  diown  dial  the  MFCs 
ate  cqiable  of  exploiting  this  limited  panlleljsm.  However, 
production  systems  with  mote  inherent  parallelism  ate  gening 
designed  [14].  In  such  production  systems,  the  paralleliam  it 
expected  to  be  much  higto  [21].  For  such  production  systems,  it 
bwomes  necessary  to  analyze  easily  sealable  architectutes  such  as 
the  MFCs  fordieir  implemmitationt. 


7.  Summary 

Reoem  advances  in  intetcoonection  netsraik  technology  and 
psooeaaing  node  design  have  reduced  die  latency  and  message 
handling  overheads  in  MFCs  to  a  few  microseconds.  Inthispepersve 
addressed  the  issue  of  efficiently  implementing  production  systems 
on  diese  new-generation  MFCs.  We  conclude  that  it  is  indeed  quite 
possible  to  implement  production  systems  efficiendy  on  MFCs.  At  a 
hi^  level,  our  mapping  cotie^onds  to  an  object  oriented  system, 
with  Rete  neturoik  nodes  passing  tokens  to  each  odier  using 
messages.  At  a  lower  level,  however,  inatead  of  mq^ping  each  Rete 
node  onto  a  single  processor,  the  state  and  die  code  associated  with  a 
node  are  diitnboted  among  the  multiple  processors.  The  main  data 
structure  that  we  exploit  in  our  mipping  is  a  concurrent  distributed 
hatii-table  that  not  ^y  allosrs  activations  of  distinct  Rete  nodes  to 
be  processed  in  parallel,  but  also  allows  multiide  activations  of  the 
same  node  to  be  prooessed  in  paralleL  A  dagfe  node  activation  is 
further  split  into  two  micTo-tasks  that  are  processed  in  parallel, 
resulting  in  very  high  expected  performance. 
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